Analysing F1 Radio Transcripts

Web & Data Science — Final Project

Thilo Hohl, 2093510

1. Ideation, Research and Sourcing

In this project, I aim to analyse and compare the 2013–2017 Formula 1 seasons and the teams competing using the processes and techniques acquired in this semesters’ lessons.

Upon forming the idea, I quickly found the article “Formula 1 Team Radio: A text-mining use case” by Marcell Ignéczi, which already touched upon most of the things I’d thought of, but only looked at the 2017 season and only sparingly correlated findings with actual events. I am hoping to go into deeper detail and also compare multiple seasons and the teams’ differences between those years.

Ignéczi’s source, radio broadcast transcripts from racefans.net, also proved to be the most complete and valuable collection of their kind, since, even though full team radios are broadcast in F1TV, this service is unavailable in Germany and I was unable to find any recordings.

racefans.net only transcribed the radio messages contained in the Liberty Media Formula One Broadcast, so a selection of highlights, not everything sent between drivers and teams. Still, since this data exists for every Grand Prix in between the 2013 and 2017, I am hopeful to find statistically significant observations.

2. Data Crawling and Formatting

Driver Standings

For context, we’ll also need the official driver standings for each year.

Formula1.com, the competition’s official web page, provides this data.

data_frames = list()
alldrivers = list()

for (year in 2013:2017) {
  domain = read_html(paste("https://www.formula1.com/en/results.html/",year,"/drivers.html", sep = ""))  
  table = html_nodes(domain, "table")
  data = html_table(table, fill = TRUE)[[1]]
  
  # Pick the columns we need
  data = data[,2:6]
  data = select(data, -c(Nationality))
  
  # Remove Tabs, newline characters, trailing spaces and clean up driver abbreviations
  data$Driver = str_replace_all(data$Driver, "[\\t \\n]+", " ")
  data$Driver = str_replace_all(data$Driver, "[A-Z]{3}", "")
  data$Driver = str_trim(data$Driver)
  data$Driver = str_replace_all(data$Driver, "Carlos Sainz", "Carlos Sainz Jnr")
  data$Driver = str_replace_all(data$Driver, "Kimi Räikkönen", "Kimi Raikkonen")
  
  for (driver in data$Driver) {
    if(!(driver %in% alldrivers)) {
      alldrivers = c(alldrivers, driver)
    }    
  }

  
  data$Year = year
  
  var_name = paste0("driverstandings", year)
  assign(var_name, data, envir = .GlobalEnv)
  data_frames[[year-2012]] = get(var_name, envir = .GlobalEnv)

  csv_name = paste0("driverstandings", year, ".csv")
  write.csv(data_frames[[year-2012]], csv_name, row.names = F)
}

Now, we’ve got a list of drivers and teams as well as the seasons’ results.

Let’s look at 2017’s standings:

driverstandings2013 = read.csv("driverstandings2013.csv")
driverstandings2014 = read.csv("driverstandings2014.csv")
driverstandings2015 = read.csv("driverstandings2015.csv")
driverstandings2016 = read.csv("driverstandings2016.csv")
driverstandings2017 = read.csv("driverstandings2017.csv")
kable(driverstandings2017)
Pos Driver Car PTS Year
1 Lewis Hamilton Mercedes 363 2017
2 Sebastian Vettel Ferrari 317 2017
3 Valtteri Bottas Mercedes 305 2017
4 Kimi Raikkonen Ferrari 205 2017
5 Daniel Ricciardo Red Bull Racing TAG Heuer 200 2017
6 Max Verstappen Red Bull Racing TAG Heuer 168 2017
7 Sergio Perez Force India Mercedes 100 2017
8 Esteban Ocon Force India Mercedes 87 2017
9 Carlos Sainz Jnr Renault 54 2017
10 Nico Hulkenberg Renault 43 2017
11 Felipe Massa Williams Mercedes 43 2017
12 Lance Stroll Williams Mercedes 40 2017
13 Romain Grosjean Haas Ferrari 28 2017
14 Kevin Magnussen Haas Ferrari 19 2017
15 Fernando Alonso McLaren Honda 17 2017
16 Stoffel Vandoorne McLaren Honda 13 2017
17 Jolyon Palmer Renault 8 2017
18 Pascal Wehrlein Sauber Ferrari 5 2017
19 Daniil Kvyat Toro Rosso 5 2017
20 Marcus Ericsson Sauber Ferrari 0 2017
21 Pierre Gasly Toro Rosso 0 2017
22 Antonio Giovinazzi Sauber Ferrari 0 2017
23 Brendon Hartley Toro Rosso 0 2017

Radio Messages

The data exists as separate articles on racefans.net for each Grand Prix, containing a few paragraphs of text and then, conveniently, a table containing lap, radio message destination (to or from certain drivers) and Message Text.

Links to our data look like this, so it’s hard to find a methodology to crawl them, since we don’t know the date they were written:

https://www.racefans.net/2016/03/23/2016-australian-grand-prix-team-radio-transcript-2/

https://www.racefans.net/2017/09/08/2017-italian-grand-prix-team-radio-transcript/

Luckily, racefans.net is searchable. We’ll just search for the term 'grand prix team radio transcript' and collect the links to the articles that way.

domain = "https://www.racefans.net/?s=Grand+Prix+team+radio+transcript"

# first page
remDr$navigate(domain)
Sys.sleep(sleep) # give the page time to fully load
page = remDr$getPageSource()[[1]]

resultpage = read_html(page)

links = resultpage %>% 
    html_nodes('.entry-title') %>% 
    html_nodes('a') %>% 
    rvest::html_attr('href')

# later pages
for (index in 2:5) {
  remDr$navigate(paste("https://www.racefans.net/page/",index,"/?s=Grand+Prix+team+radio+transcript", sep=""))
  Sys.sleep(sleep)
  page = remDr$getPageSource()[[1]]
  resultpage = read_html(page)
  morelinks = resultpage %>%     html_nodes('.entry-title') %>%     html_nodes('a') %>%     rvest::html_attr('href')
  links = append(links, morelinks)
}

Formatting

Let’s clean up the links we don’t need. The first link and the last few are different articles, and the 2017 chinese grand prix transcript is corrupted.

links = links[2:86]
links = links[-17]
write.csv(links, "links.csv", row.names = F)
links = read.csv("links.csv")
print(links)
##                                                                                           x
## 1      https://www.racefans.net/2017/11/30/2017-abu-dhabi-grand-prix-team-radio-transcript/
## 2      https://www.racefans.net/2017/11/16/2017-brazilian-grand-prix-team-radio-transcript/
## 3        https://www.racefans.net/2017/11/02/2017-mexican-grand-prix-team-radio-transcript/
## 4  https://www.racefans.net/2017/10/26/2017-united-states-grand-prix-team-radio-transcript/
## 5       https://www.racefans.net/2017/10/11/2017-japanese-grand-prix-team-radio-transcript/
## 6      https://www.racefans.net/2017/10/04/2017-malaysian-grand-prix-team-radio-transcript/
## 7      https://www.racefans.net/2017/09/25/2017-singapore-grand-prix-team-radio-transcript/
## 8        https://www.racefans.net/2017/09/08/2017-italian-grand-prix-team-radio-transcript/
## 9        https://www.racefans.net/2017/08/31/2017-belgian-grand-prix-team-radio-transcript/
## 10     https://www.racefans.net/2017/08/04/2017-hungarian-grand-prix-team-radio-transcript/
## 11       https://www.racefans.net/2017/07/19/2017-british-grand-prix-team-radio-transcript/
## 12    https://www.racefans.net/2017/06/30/2017-azerbaijan-grand-prix-team-radio-transcript/
## 13      https://www.racefans.net/2017/06/14/2017-canadian-grand-prix-team-radio-transcript/
## 14        https://www.racefans.net/2017/06/02/2017-monaco-grand-prix-team-radio-transcript/
## 15       https://www.racefans.net/2017/05/05/2017-russian-grand-prix-team-radio-transcript/
## 16       https://www.racefans.net/2017/04/21/2017-bahrain-grand-prix-team-radio-transcript/
## 17    https://www.racefans.net/2017/03/31/2017-australian-grand-prix-team-radio-transcript/
## 18     https://www.racefans.net/2016/11/30/2016-abu-dhabi-grand-prix-team-radio-transcript/
## 19     https://www.racefans.net/2016/11/16/2016-brazilian-grand-prix-team-radio-transcript/
## 20       https://www.racefans.net/2016/11/01/2016-mexican-grand-prix-team-radio-transcript/
## 21 https://www.racefans.net/2016/10/26/2016-united-states-grand-prix-team-radio-transcript/
## 22      https://www.racefans.net/2016/10/17/2016-japanese-grand-prix-team-radio-transcript/
## 23     https://www.racefans.net/2016/10/05/2016-malaysian-grand-prix-team-radio-transcript/
## 24     https://www.racefans.net/2016/09/21/2016-singapore-grand-prix-team-radio-transcript/
## 25       https://www.racefans.net/2016/09/13/2016-italian-grand-prix-team-radio-transcript/
## 26       https://www.racefans.net/2016/09/13/2016-belgian-grand-prix-team-radio-transcript/
## 27        https://www.racefans.net/2016/08/03/2016-german-grand-prix-team-radio-transcript/
## 28     https://www.racefans.net/2016/08/03/2016-hungarian-grand-prix-team-radio-transcript/
## 29       https://www.racefans.net/2016/07/18/2016-british-grand-prix-team-radio-transcript/
## 30      https://www.racefans.net/2016/07/07/2016-austrian-grand-prix-team-radio-transcript/
## 31      https://www.racefans.net/2016/06/22/2016-european-grand-prix-team-radio-transcript/
## 32      https://www.racefans.net/2016/06/16/2016-canadian-grand-prix-team-radio-transcript/
## 33       https://www.racefans.net/2016/05/04/2016-russian-grand-prix-team-radio-transcript/
## 34       https://www.racefans.net/2016/04/22/2016-chinese-grand-prix-team-radio-transcript/
## 35       https://www.racefans.net/2016/04/06/2016-bahrain-grand-prix-team-radio-transcript/
## 36  https://www.racefans.net/2016/03/23/2016-australian-grand-prix-team-radio-transcript-2/
## 37     https://www.racefans.net/2015/12/01/2015-abu-dhabi-grand-prix-team-radio-transcript/
## 38     https://www.racefans.net/2015/11/17/2015-brazilian-grand-prix-team-radio-transcript/
## 39       https://www.racefans.net/2015/11/04/2015-mexican-grand-prix-team-radio-transcript/
## 40 https://www.racefans.net/2015/10/29/2015-united-states-grand-prix-team-radio-transcript/
## 41       https://www.racefans.net/2015/10/14/2015-russian-grand-prix-team-radio-transcript/
## 42      https://www.racefans.net/2015/09/30/2015-japanese-grand-prix-team-radio-transcript/
## 43     https://www.racefans.net/2015/09/23/2015-singapore-grand-prix-team-radio-transcript/
## 44       https://www.racefans.net/2015/09/09/2015-italian-grand-prix-team-radio-transcript/
## 45       https://www.racefans.net/2015/08/26/2015-belgian-grand-prix-team-radio-transcript/
## 46     https://www.racefans.net/2015/07/29/2015-hungarian-grand-prix-team-radio-transcript/
## 47       https://www.racefans.net/2015/07/08/2015-british-grand-prix-team-radio-transcript/
## 48      https://www.racefans.net/2015/06/25/2015-austrian-grand-prix-team-radio-transcript/
## 49      https://www.racefans.net/2015/06/10/2015-canadian-grand-prix-team-radio-transcript/
## 50        https://www.racefans.net/2015/05/27/2015-monaco-grand-prix-team-radio-transcript/
## 51       https://www.racefans.net/2015/05/13/2015-spanish-grand-prix-team-radio-transcript/
## 52       https://www.racefans.net/2015/04/24/2015-bahrain-grand-prix-team-radio-transcript/
## 53       https://www.racefans.net/2015/04/23/2015-chinese-grand-prix-team-radio-transcript/
## 54     https://www.racefans.net/2014/11/27/2014-abu-dhabi-grand-prix-team-radio-transcript/
## 55     https://www.racefans.net/2014/11/14/2014-brazilian-grand-prix-team-radio-transcript/
## 56 https://www.racefans.net/2014/11/05/2014-united-states-grand-prix-team-radio-transcript/
## 57       https://www.racefans.net/2014/10/15/2014-russian-grand-prix-team-radio-transcript/
## 58      https://www.racefans.net/2014/10/08/2014-japanese-grand-prix-team-radio-transcript/
## 59     https://www.racefans.net/2014/09/24/2014-singapore-grand-prix-team-radio-transcript/
## 60       https://www.racefans.net/2014/09/10/2014-italian-grand-prix-team-radio-transcript/
## 61       https://www.racefans.net/2014/08/31/2014-belgian-grand-prix-team-radio-transcript/
## 62   https://www.racefans.net/2014/07/30/2014-hungarian-grand-prix-team-radio-transcript-2/
## 63        https://www.racefans.net/2014/07/23/2014-german-grand-prix-team-radio-transcript/
## 64       https://www.racefans.net/2014/07/09/2014-british-grand-prix-team-radio-transcript/
## 65      https://www.racefans.net/2014/06/25/2014-austrian-grand-prix-team-radio-transcript/
## 66      https://www.racefans.net/2014/06/18/2014-canadian-grand-prix-team-radio-transcript/
## 67       https://www.racefans.net/2014/04/23/2014-chinese-grand-prix-team-radio-transcript/
## 68       https://www.racefans.net/2014/04/09/2014-bahrain-grand-prix-team-radio-transcript/
## 69     https://www.racefans.net/2014/04/02/2014-malaysian-grand-prix-team-radio-transcript/
## 70    https://www.racefans.net/2014/03/18/2014-australian-grand-prix-team-radio-transcript/
## 71     https://www.racefans.net/2013/11/26/2013-brazilian-grand-prix-team-radio-transcript/
## 72 https://www.racefans.net/2013/11/20/2013-united-states-grand-prix-team-radio-transcript/
## 73     https://www.racefans.net/2013/11/09/2013-abu-dhabi-grand-prix-team-radio-transcript/
## 74        https://www.racefans.net/2013/11/01/2013-indian-grand-prix-team-radio-transcript/
## 75    https://www.racefans.net/2013/10/17/2013-japanese-grand-prix-team-radio-transcript-2/
## 76        https://www.racefans.net/2013/10/09/2013-korean-grand-prix-team-radio-transcript/
## 77     https://www.racefans.net/2013/09/26/2013-singapore-grand-prix-team-radio-transcript/
## 78       https://www.racefans.net/2013/09/13/2013-italian-grand-prix-team-radio-transcript/
## 79       https://www.racefans.net/2013/08/29/2013-belgian-grand-prix-team-radio-transcript/
## 80     https://www.racefans.net/2013/08/02/2013-hungarian-grand-prix-team-radio-transcript/
## 81        https://www.racefans.net/2013/07/11/2013-german-grand-prix-team-radio-transcript/
## 82       https://www.racefans.net/2013/07/03/2013-british-grand-prix-team-radio-transcript/
## 83      https://www.racefans.net/2013/06/25/2013-canadian-grand-prix-team-radio-transcript/
## 84                   https://www.racefans.net/2013/05/29/2013-monaco-grand-prix-team-radio/

Now, we can extract the radio transmissions from our links.

We’ll need to account for two different table formats, as the page changed them slightly after 2015. We also have to filter out emphasized (<em>) text, which is context given by the author and not message content.

Season and Location are added from the header in two columns Year and GP.

radio = data.frame(matrix(ncol = 5, nrow = 0))
cols = c('`Lap**`', 'Driver', 'Message', 'Year', 'GP')
colnames(radio) = cols
for (link in 1:84) {#1:84) { 
  remDr$navigate(links[link, 1])
  Sys.sleep(sleep)
  page = remDr$getPageSource()[[1]]
  page = read_html(page)
  table = html_nodes(page, "table")
  if(link == 27) { dataframe = html_table(table, fill = TRUE)[[2]] }
  else { dataframe = html_table(table, fill = TRUE)[[1]] }
  
  # Format: Lap | Driver | Message
  title = html_nodes(page, '.entry-title') %>% html_text  
  dataframe$Year = str_extract(title, "\\b\\w+")
  dataframe$GP = unlist(str_extract_all(title, "(?<=\\d\\d\\d\\d )[\\w ]+(?= Grand)"))
  
  # Format: Lap | To | From | Message | +Year | +GP
  if (dataframe$Year[1] < 2016) {
  
    for (row in 1:nrow(dataframe)) {
      if (dataframe$To[row] %in% alldrivers) {
          dataframe$Driver[row] = paste("To",dataframe$To[row], sep=" ")
      } else {
          dataframe$Driver[row] = paste("From",dataframe$From[row], sep=" ")
      }
    }
    dataframe = select(dataframe, -c(To, From))
  }
  radio = rbind(radio, dataframe)
  write.csv(radio, "radio.csv", row.names = F)
}
pradio = read.csv("radio.csv")

Now we’ve got our data sorted by year and Grand Prix in a dataframe.

Let’s add a column for Team and Driver name, since that’s not in our dataframe yet. We’ll also add the column Dir, indicating the direction of the message (T = to and F = from the Driver).

# Extract dir and Name from Driver
pradio$Driver_Name = str_replace_all(pradio$Driver, "(To )*(From )*", "")
pradio$Dir = substring(pradio$Driver, 1,1)
pradio$Driver = pradio$Driver_Name
pradio = select(pradio, -Driver_Name)

# Add team by Year and Driver
driverstandings = bind_rows(
  driverstandings2013, driverstandings2014, driverstandings2015,  driverstandings2016, driverstandings2017
)
pradio = pradio %>%
  left_join(driverstandings, by = c("Driver", "Year")) %>%
  select(Driver, Dir, Team = Car, Message, Lap = Lap., GP, Year, DPos = Pos)

# Merge Team Names 
pradio$Team = str_replace_all(pradio$Team, "Red Bull Racing TAG Heuer", "Red Bull")
pradio$Team = str_replace_all(pradio$Team, "Red Bull Racing Renault", "Red Bull")
pradio$Team = str_replace_all(pradio$Team, "McLaren Mercedes", "McLaren")
pradio$Team = str_replace_all(pradio$Team, "McLaren Honda", "McLaren")
pradio$Team = str_replace_all(pradio$Team, "Lotus Mercedes", "Lotus")
pradio$Team = str_replace_all(pradio$Team, "Lotus Renault", "Lotus")
pradio$Team = str_replace_all(pradio$Team, "Haas Ferrari", "Haas")
pradio$Team = str_replace_all(pradio$Team, "Force India Mercedes", "Force India")
pradio$Team = str_replace_all(pradio$Team, "STR Renault", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "STR Ferrari", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "Toro Rosso Ferrari", "Toro Rosso")
pradio$Team = str_replace_all(pradio$Team, "Williams Mercedes", "Williams")
pradio$Team = str_replace_all(pradio$Team, "Williams Renault", "Williams")
pradio$Team = str_replace_all(pradio$Team, "Caterham Renault", "Caterham")
pradio$Team = str_replace_all(pradio$Team, "Marussia Ferrari", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "Marussia Cosworth", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "MRT Mercedes", "Marussia/Manor")
pradio$Team = str_replace_all(pradio$Team, "Sauber Ferrari", "Sauber")

driverstandings$Car = str_replace_all(driverstandings$Car, "Red Bull Racing TAG Heuer", "Red Bull")
driverstandings$Car = str_replace_all(driverstandings$Car, "Red Bull Racing Renault", "Red Bull")
driverstandings$Car = str_replace_all(driverstandings$Car, "McLaren Mercedes", "McLaren")
driverstandings$Car = str_replace_all(driverstandings$Car, "McLaren Honda", "McLaren")
driverstandings$Car = str_replace_all(driverstandings$Car, "Lotus Mercedes", "Lotus")
driverstandings$Car = str_replace_all(driverstandings$Car, "Lotus Renault", "Lotus")
driverstandings$Car = str_replace_all(driverstandings$Car, "Haas Ferrari", "Haas")
driverstandings$Car = str_replace_all(driverstandings$Car, "Force India Mercedes", "Force India")
driverstandings$Car = str_replace_all(driverstandings$Car, "STR Renault", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "STR Ferrari", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "Toro Rosso Ferrari", "Toro Rosso")
driverstandings$Car = str_replace_all(driverstandings$Car, "Williams Mercedes", "Williams")
driverstandings$Car = str_replace_all(driverstandings$Car, "Williams Renault", "Williams")
driverstandings$Car = str_replace_all(driverstandings$Car, "Caterham Renault", "Caterham")
driverstandings$Car = str_replace_all(driverstandings$Car, "Marussia Ferrari", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "Marussia Cosworth", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "MRT Mercedes", "Marussia/Manor")
driverstandings$Car = str_replace_all(driverstandings$Car, "Sauber Ferrari", "Sauber")

teamColors = c(
  "Mercedes" = "grey",
  "Haas" = "black",
  "McLaren" = "orange",
  "Ferrari" = "red",
  "Force India" = "pink",
  "Renault" = "yellow",
  "Williams" = "skyblue",
  "Red Bull" = "darkblue",
  "Toro Rosso" = "blue",
  "Sauber" = "darkred",
  "Marussia/Manor" = "#ff5a5e",
  "Lotus" = "gold",
  "Caterham" = "darkgreen"
)

Our data is formatted and finished! Let’s look at some samples:

kable(pradio[sample(nrow(pradio), 5), ])
Driver Dir Team Message Lap GP Year DPos
3325 Max Verstappen T Red Bull Well done. As expected. You did a really solid job. VL Australian 2017 6
15545 Felipe Massa T Ferrari Understood: ‘Bent’. 24 Canadian 2013 8
5081 Daniel Ricciardo T Red Bull Okay mate, it’s time to push now. 25 Bahrain 2016 3
7767 Carlos Sainz Jnr F Toro Rosso Seven. 25 Austrian 2015 15
15147 Kimi Raikkonen T Lotus We understood that tyres are not bad. 47 German 2013 5

The Codes in the Lap column mean:

  • PR = pre-race
  • FL = formation lap
  • VL = victory lap,

while a number equals the lap number this radio transmission was played (meaning sent, in most cases) during the race.

Slicing Data

For ease of use, we’ll divide pradio into different sizes based on the number of messages sent to the driver.

numMessages = table(unlist(pradio$Driver))
sortedNumMessages = sort(numMessages, decreasing = TRUE)
topNames = as.data.frame(sortedNumMessages[1:39])

# Top 39 Drivers
pradioTop = subset(pradio, Driver %in% topNames$Var1)

# Top 20 Drivers
pradioTop20 = subset(pradio, Driver %in% topNames$Var1[1:20])

# Top Drivers
pradioTop10 = subset(pradio, Driver %in% topNames$Var1[1:10])
The data is distributed as follows:

pradioTop10
10,301 (64,5%)

pradioTop20
14,158 (88,6%)

pradioTop
15,875 (99,4%)


3. Data Exploration and Visualization

Now we’ve got 15,962 Messages to play with.

For general context, here’s the driver’s championship points plotted over the years in our data:

ds10 = subset(driverstandings, Driver %in% topNames$Var1[1:10])

ggplot(ds10, aes(x = Year, y = PTS, group = Driver, color = Car, label = Driver)) +
  geom_line(linewidth = 2, lineend = "round") +
  geom_text(nudge_x = 0, vjust = -0.5, size = 3) +
  labs(title = "Driver Points by Year", x = "Year", y = "Driver Points") +
  scale_color_manual(values = teamColors) +
  theme_minimal() +
  theme(legend.position = "bottom")

Let’s add wordLength to our dataframe for some further analysis.

pradio$wordLength = str_count(pradio$Message, "\\w+")

# Update Subsets
pradioTop = subset(pradio, Driver %in% topNames$Var1)
pradioTop20 = subset(pradio, Driver %in% topNames$Var1[1:20])
pradioTop10 = subset(pradio, Driver %in% topNames$Var1[1:10])

Descriptive statistics are not too descriptive in this case:

summary(pradio)
##     Driver              Dir                Team             Message         
##  Length:15962       Length:15962       Length:15962       Length:15962      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      Lap                 GP                 Year           DPos      
##  Length:15962       Length:15962       Min.   :2013   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:2014   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :2015   Median : 7.00  
##                                        Mean   :2015   Mean   : 8.01  
##                                        3rd Qu.:2016   3rd Qu.:12.00  
##                                        Max.   :2017   Max.   :24.00  
##                                                       NA's   :85     
##    wordLength    
##  Min.   :  1.00  
##  1st Qu.:  6.00  
##  Median : 11.00  
##  Mean   : 14.36  
##  3rd Qu.: 19.00  
##  Max.   :150.00  
##  NA's   :1

Plots

Message Distribution

ggplot(topNames[1:10,], aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="steelblue")+
  geom_text(aes(label=Freq), vjust=1.6, color="white", size=5)+
  labs(x="Driver", y = "Number of Messages", title = "Message Distribution (Top 10 Drivers)") +
  scale_color_manual(teamColors) +
  theme_minimal()

numMessages = table(unlist(pradio$Team))
sortedNumMessages = sort(numMessages, decreasing = TRUE)
topTeams = as.data.frame(sortedNumMessages)

ggplot(topTeams[1:10,], aes(x=Var1, y=Freq, fill=Var1)) +
  geom_bar(stat="identity")+
  geom_text(aes(label=Freq), vjust=1.6, color="white", size=5)+
  labs(x="Team", y = "Number of Messages", title = "Message Distribution (Top 10 Teams)") +
  scale_fill_manual(values=teamColors) +
  theme_minimal() +
  theme(legend.position = "none") 

To-From Distribution for drivers

teamm = nrow(subset(pradio, Dir == "T"))
driverm = nrow(subset(pradio, Dir == "F"))
prop = c(teamm, driverm)
pie(prop, labels = c(paste("Team to Driver (", teamm, ")", sep=""), paste("Driver to Team (", driverm, ")", sep="")), border="white", col=c("steelblue", "darkgrey"))

Naturally, the team feeds more information and guidance to the drivers, as they’ve got the data and the tactics in front of them.
We’re looking at 66% team messages and 34% driver radios.
pradio$Year = factor(pradio$Year)
words_year = pradio %>% group_by(Year) %>% summarize(wordLength)

ggplot(words_year, aes(x = Year, y = wordLength, fill = Year)) +
  geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
  labs(title = "Average Message Length (Words)", x = "Year", y = "Average N. of Words per Message") +
  scale_y_log10() +
  theme_minimal()

Over the years, messages have become shorter.

Correlations

Let’s see if message length is a factor for success:

message_len = pradio %>%
  group_by(Year, DPos) %>%
  summarize(meanLength = mean(wordLength))

ggplot(message_len, aes(x = DPos, y = meanLength, color = Year)) +
  geom_point() +
  labs(
    title = "Avg. Message Length (Words) by Championship Position",
    x = "Position (Driver Championship)",
    y = "Mean message length (Words)"
  ) +
  geom_smooth(se=FALSE) +
  xlim(1,24) +
  scale_x_reverse() +
  theme_minimal() +
  theme(legend.position = "bottom")
One might think that worse drivers need slightly more communication — or, that longer communication might lead to worse performance;
I’m not sure that the effect is significant enough here.
message_count = pradio %>%
  group_by(Year, DPos) %>%
  summarize(NumMessages = n())

ggplot(message_count, aes(x = DPos, y = NumMessages, color = Year)) +
  geom_point() +
  labs(title = "Avg. Number of Messages by Championship Position", x = "Position (Driver Championship)", y = "Number of Messages broadcast") +
  geom_smooth(se=FALSE) +
  xlim(1,24) +
  scale_x_reverse() +
  theme_minimal() +
  theme(legend.position = "bottom")
There appears to be a correlation between Messages sent and Championship Position; the drivers at the front of the grid are broadcast more frequently.
In more recent years, this effect seemingly shrinks, providing a slightly fairer selection of messages.

Word Clouds

Let’s create word clouds to see the most frequent words for each team:

team_data_frames = pradio %>%
  group_split(Team)

team_freq = lapply(team_data_frames, function(team_df) {
  corpus = Corpus(VectorSource(team_df$Message))
  corpus = tm_map(corpus, content_transformer(tolower))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeNumbers)
  corpus = tm_map(corpus, removeWords, stopwords("english"))
  term_freq = TermDocumentMatrix(corpus)
  term_freq = as.matrix(term_freq)
  word_freq = rowSums(term_freq)
  word_freq = sort(word_freq, decreasing = TRUE)
  word_freq = data.frame(word = names(word_freq), freq = word_freq)
  return(word_freq)
})

Ferrari

Mercedes

Red Bull

McLaren

Since we have lap numbers, let’s see how the topics and words used changed throughout the race:

scopes = list(
  c('PR', 'FL'), # Pre-Race
  seq(1, 30),    # Laps 1-30
  seq(31, 78),   # Laps >30
  'VL'           # Victory Lap
)

Before the race

Drivers drive a Formation Lap to warm up their tyres, practise for the start and line up on the grid.

Messages obviously focus on the importance of a good start and the launching of car using the hand-operated clutch, as well as getting the right temperature in the front and back tyres and charging the KERS (Kinetic Energy Recovery System).
The possibly race-deciding turns one and two are also mentioned, as well as conditions of the track (rain, wind and temperature).

The first half of the race

The second half of the race

There’s no real significant difference between the two halves of the race; perhaps between most races aren’t divided into two, but into smaller stints, from tyre to tyre.
Laps are mentioned quite frequently because times for each lap are constantly being compared, both personally and against other drivers. The box is high up in rank because it is repeated to assure understanding, as pitting at the right time has a great effect on one’s race. Tyres and pace need to be monitored closely.

Victory Lap

The Victory Lap or cooldown lap serves to return drivers to the pit entry, since they pass it on the way to the finish line. In the broadcast, it’s mostly used as an opportunity for Driver and Team to celebrate.

As expected, we observe overwhelmingly positive utterances, as the broadcast focuses on the winner. Good Job, indeed!
Little snippets of the other places can also be found: “sorry”, “shame”, and “tough”.

Message Content: Team vs Driver

What do both parties focus on?

party_data_frames = pradio %>%
  group_split(Dir)

party_freq = lapply(party_data_frames, function(df) {
  corpus = Corpus(VectorSource(df$Message))
  corpus = tm_map(corpus, content_transformer(tolower))
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeWords, stopwords("english"))
  term_freq = TermDocumentMatrix(corpus)
  term_freq = as.matrix(term_freq)
  word_freq = rowSums(term_freq)
  word_freq = sort(word_freq, decreasing = TRUE)
  word_freq = data.frame(word = names(word_freq), freq = word_freq)
  return(word_freq)
})

Driver to Team

wordcloud2(party_freq[[1]], color ="darkgrey")

Team to Driver

wordcloud2(party_freq[[2]], color ="steelblue")
Drivers communicate more emotionally (spot the censored up there?), using less precise langugae. They share thoughts on their experience in the car (tyres, brakes, wings, balance between front and rear and so on), while the team give more feedback (good job once again), ask whether things can be done and state what needs to be done.

Sentiment Analysis

Let’s compare lexicons to see which will give better results with our data.

sentiments = get_sentiments("bing")

# Tokenize the text messages
pradio_tokens = pradio %>%
  unnest_tokens(word, Message)

pradio_sentiment = pradio_tokens %>%
  inner_join(sentiments, by = c(word = "word"))

kable(head(pradio))
Driver Dir Team Message Lap GP Year DPos wordLength
Lewis Hamilton F Mercedes It’s quite windy out there. PR Abu Dhabi 2017 1 6
Lewis Hamilton T Mercedes To Hamilton: Still headwind turn two, it may have rotated compared to yesterday so more of a tailwind turn eight. Still predominantly the same direction, though. PR Abu Dhabi 2017 1 26
Romain Grosjean F Haas Grosjean: Thanks for waiting, I’ve finished my drink. PR Abu Dhabi 2017 13 9
Romain Grosjean T Haas No problem, that was a big drink. PR Abu Dhabi 2017 13 7
Lewis Hamilton F Mercedes Is the temperature still dropping? PR Abu Dhabi 2017 1 5
Lewis Hamilton T Mercedes Yeah it should be dropping. The sun’s just disappearing behind the grandstand. It should be down. PR Abu Dhabi 2017 1 17
nrow(pradio_sentiment)
## [1] 14861
kable(head(pradio_sentiment))
Driver Dir Team Lap GP Year DPos wordLength word sentiment
Romain Grosjean T Haas PR Abu Dhabi 2017 13 7 problem negative
Fernando Alonso T McLaren PR Abu Dhabi 2017 15 17 soft positive
Fernando Alonso T McLaren PR Abu Dhabi 2017 15 17 soft positive
Lewis Hamilton F Mercedes FL Abu Dhabi 2017 1 6 fumes negative
Lewis Hamilton F Mercedes FL Abu Dhabi 2017 1 6 pretty positive
Lewis Hamilton F Mercedes FL Abu Dhabi 2017 1 6 strong positive
sentiments = get_sentiments("afinn")

pradio_sentiment = pradio_tokens %>%
  inner_join(sentiments, by = c(word = "word"))

pradioTop10_sentiment = pradioTop10 %>%
  unnest_tokens(word, Message) %>%
  inner_join(sentiments, by = c(word = "word"))

nrow(pradio_sentiment)
## [1] 15491
kable(head(pradio_sentiment))
Driver Dir Team Lap GP Year DPos wordLength word value
Romain Grosjean F Haas PR Abu Dhabi 2017 13 9 thanks 2
Romain Grosjean T Haas PR Abu Dhabi 2017 13 7 no -1
Romain Grosjean T Haas PR Abu Dhabi 2017 13 7 problem -2
Romain Grosjean T Haas PR Abu Dhabi 2017 13 7 big 1
Lewis Hamilton T Mercedes PR Abu Dhabi 2017 1 17 yeah 1
Lewis Hamilton F Mercedes FL Abu Dhabi 2017 1 6 pretty 1
AFINN performs slightly better and also gives us more detail to work with. Let’s go with that!
ggplot(pradio_sentiment, aes(x = Team, y = value, fill = Team)) +
  geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
  labs(title = "Sentiment per Team", x = "Team", y = "Sentiment Score (AFINN)") +
  scale_fill_manual(values=teamColors) +
  theme_minimal() +
  theme(legend.position =  "none") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Every team uses pore positive than negative language. Williams seem to be in the best mood, while Renault have more bad days than the rest of the field.
ggplot(pradioTop10_sentiment, aes(x = Driver, y = value)) +
  geom_boxplot(outlier.color="black", outlier.shape=16, outlier.size = 2) +
  labs(title = "Sentiment of pradioTop10 Drivers", x = "Driver", y = "Sentiment Score (AFINN)") +
  theme_minimal() +
  theme(legend.position = "bottom") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

pradio_sentiment$Lap <- ifelse(pradio_sentiment$Lap == "PR", 0,
                              ifelse(pradio_sentiment$Lap == "FL", 0,
                                ifelse(pradio_sentiment$Lap == "VL", max(pradio_sentiment$Lap <80),
                                      pradio_sentiment$Lap)))

pradio_sentiment$Lap <- as.integer(pradio_sentiment$Lap)

average_lap_sentiment <- pradio_sentiment %>%
  group_by(Lap, GP) %>%
  summarize(avg_sentiment = mean(value))

average_lap_sentiment %>%
  ggplot(aes(x=Lap, y=avg_sentiment, group=GP, fill=GP)) +
    geom_line(linewidth = .2, lineend = "round", color="black") +
    geom_smooth(method="lm", linewidth = .3, lineend = "round", color="red") +
    labs(title = "Average Sentiment during the Race per GP", x = "Lap", y = "Average Sentiment Score (AFINN)") +
    theme_minimal() + 
    theme(
      axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8),
      plot.title = element_text(size=14)
    ) +
    facet_wrap(~GP, ncol=4)

Over the season, the happiest Grands Prix seem to be in Abu Dhabi, Belgium and Korea, while the European GP continually got worse. A greater amplitude of sentiment could be interpreted as more exciting for the viewer, as there are more extreme emotions at play.
average_team_sentiment <- pradio_sentiment %>%
  filter(!is.na(Team)) %>%
  group_by(Team, Year) %>%
  summarize(avg_sentiment = mean(value))

ggplot(average_team_sentiment, aes(x = Year, y = avg_sentiment, group = Team, color = Team)) +
  geom_line(linewidth = 2, lineend = "round") +
  labs(title = "Average Sentiment per Team over the Years", x = "Year", y = "Average Sentiment Score (AFINN)") +
  scale_color_manual(values = teamColors) +
  theme_minimal() +
  theme(legend.position = "bottom")

Red Bull is happiest in 2013, the last year in our data where they won both championships; Mercedes, starting their dominance in the championship, score fourth happiest, then continue being second.
team_sentiment_years = average_team_sentiment %>%
  left_join(pradio_sentiment, by = c("Team", "Year"))

ggplot(team_sentiment_years, aes(y = avg_sentiment, x = DPos, color = Team)) +
  geom_point() +
  geom_smooth(method = "lm", color="red", se=TRUE) +
  labs(title = "Correlation between Average Sentiment and Driver Position", y = "Average Sentiment Score (AFINN)", x = "Driver Position") +
  theme_minimal() +
  scale_x_reverse() +
  scale_color_manual(values = teamColors) +
  theme(legend.position = "bottom")

There appears to be a slight correlation between driver position and Sentiment, but it’s smaller than I expected. Williams, who’ve been consistently happy, are about as content with P17 and P18 as Mercedes are with P1 and P2, while Ferrari lie well below the expected sentiment for their positions.